Method for anonymizing network data using differential privacy

ABSTRACT

The invention described herein is directed to methods and systems for protecting network trace data. Network traces are used for network management, packet classification, traffic engineering, tracking user behavior, identifying user behavior, analyzing network hierarchy, maintaining network security, and classifying packet flows. In some embodiments, network trace data is protected by subjecting network trace data to data anonymization using an anonymization algorithm that simultaneously provides sufficient privacy to accommodate the organization need of the network trace data owner, provides acceptable data utility to accommodate management and/or network investigative needs, and provides efficient data analysis, at the same time.

PRIOR FILIED APPLICATIONS

This application claims priority benefit to U.S. patent application Ser.No. 62,892,726 entitled “A Method for Anonymizing Network Data UsingDifferential Privacy” filed Aug. 28, 2019, the contents of which areincorporated herein in their entirety.

STATEMENT REGARDING FEDERALLY SPONSORED R&D

The invention was made with U.S. Government support under Grant No.11183 awarded by the MITRE-USM FFRDC . The U.S. Government has certainrights in the invention.

BACKGROUND

The embodiments described herein relate generally to network security,and particularly to the protection of network trace data.

Network security is one of the most significant issues for anyorganization, and network trace data is a primary asset that needs to beprotected. It can be used in several tasks, such as network management,packet classification, traffic engineering, and tracking user behavior.However, these tasks are routinely performed by external organizations.Releasing network trace data to external entities is a very sensitiveissue for any organization, and it is often prohibited because sharingsuch data exposes critical information of the organization, such as IPaddresses, user IDs, passwords, host addresses, emails, personalweb-pages, and even authentication keys.

Accordingly, a need exists for methods and systems for protectingsensitive network trace data.

BRIEF SUMMARY OF THE INVENTION

The embodiments described herein are directed to methods and systems forprotecting network trace data.

Network traces are used for network management, packet classification,traffic engineering, tracking user behavior, identifying user behavior,analyzing network hierarchy, maintaining network security, andclassifying packet flows.

In some embodiments, network trace data is protected by subjectingnetwork trace data to data anonymization using an anonymizationalgorithm that simultaneously provides sufficient privacy to accommodatethe organization need of the network trace data owner, providesacceptable data utility to accommodate management and/or networkinvestigative needs, and provides efficient data analysis, at the sametime.

In some embodiments described herein, the systems and methods provide acondensation-based differential privacy anonymization method thatachieves an improved tradeoff between privacy and utility compared toexisting techniques and produces anonymized network trace data that canbe shared without lowering its utility value.

In some embodiments, the method does not incur extra computationoverhead for the data analyzer. In some implementations, the systems andmethods have shown that the invention preserves privacy and allows dataanalysis without revealing the original data even when injection attacksare launched against it. In some embodiments, the systems and methodsare capable of providing identical intrusion detection rates for theanonymized datasets compared to original datasets of network trace data.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is an illustration of approach and overall system architecture.

FIG. 2 is an illustration of Packet aggregation to create flows.

FIG. 3 is an illustration of Anonymization of IP addresses.

FIG. 4 is an illustration of Anonymization of non-IP features using anenhanced condensation approach.

FIG. 5 is an illustration of Cluster-based differential privacyanonymization example.

FIG. 6 is an illustration of A scenario on Injection attack and itsrecovery.

FIG. 7 is an illustration of Privacy results of condensation-basedanonymization techniques using Dataset 1.

FIG. 8 is an illustration of Privacy results of condensation-basedanonymization techniques using Dataset 2.

FIG. 9 is an illustration of Privacy results of different anonymizationmethods with Dataset 1.

FIG. 10 is an illustration of Privacy results of different anonymizationmethods with Dataset 2.

FIG. 11 is an illustration of SLN before Anonymization.

FIG. 12 is an illustration of SLN after Anonymization.

FIG. 13a . Precision values for SLNs-original vs. anonymized.

FIG. 13b . Recall values for SLNs-original vs. anonymized dataset1.

FIG. 13c . Accuracy values for SLNs- original vs. anonymized dataset1.

FIG. 13d . F-measure for SLNs- original vs. anonymized dataset1.

FIG. 13e . ROC curve for SLNs- original vs. anonymized dataset1.

FIG. 14a . Testing data injection attacks using various anonymizationpolicies on data set 1.

FIG. 14b . Testing data injection attacks using various anonymizationpolicies on data set 2.

DETAILED DESCRIPTION

Disclosed Embodiments are Directed to

Any of the methods and systems described herein can provide wherein

Definitions

The terminology used herein is for the purpose of describing particularembodiments only and is not intended to limit the full scope of theclaims. Unless defined otherwise, all technical and scientific termsused herein have the same meanings as commonly understood by one ofordinary skill in the art. Nothing in this disclosure is to be construedas an admission that the embodiments described in this disclosure arenot entitled to antedate such disclosure by virtue of prior invention.

As used herein, the singular forms “a”, “an” and “the” are intended toinclude the plural forms as well, unless the context clearly indicatesotherwise. With respect to the use of substantially any plural and/orsingular terms herein, those having skill in the art can translate fromthe plural to the singular and/or from the singular to the plural as isappropriate to the context and/or application. The varioussingular/plural permutations may be expressly set forth herein for sakeof clarity.

In general, terms used herein, and especially in the appended claims(e.g., bodies of the appended claims) are generally intended as “open”terms (e.g., the term “including” should be interpreted as “includingbut not limited to,” the term “having” should be interpreted as “havingat least,” etc.). Similarly, the terms “comprises” and/or “comprising,”when used in this specification, specify the presence of statedfeatures, integers (or fractions thereof), steps, operations, elements,and/or components, but do not preclude the presence or addition of oneor more other features, integers (or fractions thereof), steps,operations, elements, components, and/or groups thereof. As used in thisdocument, the term “comprising” means “including, but not limited to.”

As used herein the term “and/or” includes any and all combinations ofone or more of the associated listed items. It should be understood thatvirtually any disjunctive word and/or phrase presenting two or morealternative terms, whether in the description, claims, or drawings,should be understood to contemplate the possibilities of including oneof the terms, either of the terms, or both terms. For example, thephrase “A or B” will be understood to include the possibilities of “A”or “B” or “A and B.”

All ranges disclosed herein also encompass any and all possiblesubranges and combinations of subranges thereof unless expressly statedotherwise. Any listed range should be recognized as sufficientlydescribing and enabling the same range being broken down into at leastequal subparts unless expressly stated otherwise. As will be understoodby one skilled in the art, a range includes each individual member.

Definitions

Anonymization is a process of excluding sensitive identifiers from adataset but keeping its statistical characteristics so that it can bestill useful for analysis and scientific research by external entities[20]. A network trace is the flow of packets between a sender and areceiver. It contains attributes such as source and destinationaddresses, source and destination port numbers, MAC address, timestamp,packet length, protocol, payload, etc. Adversaries may use some of theseattributes to either identify end points (e.g. addresses), reveal userbehavior (e.g. payload), or inject certain information (e.g. timestamp)that can be easily tracked.

Anonymization algorithms vary based on the level of anonymizationperformed on the data, and the type of features sanitized. For instance,enumeration can only be applied to numeric attributes in sequentialorder (e.g., IP address). Enumeration sorts the values of an attributein ascending order and adds a value that is larger than the originalone. This technique can be applied to all attributes [1].

A scheme may be proposed to anonymize network traces by shifting eachvalue with a random offset, to replace the original value. A randompermutation scheme can be used with timestamps. It is a one to onemapping process and is mostly applied to IP and MAC addresses.

Permutation requires two tables, one for maintaining the mapping fromnon-anonymized to anonymized IP addresses, and another to storeanonymized IP addresses. Prefix preserving pseudonymization works in thesame manner It concentrates on satisfying the following rule: If theoriginal IP address has a k bit prefix, the anonymized version mustshare the same prefix. Xu et al. in [7] released a trace dataset using aprefix preserving technique. Overall, prefix preservation has a greaterutility than that of Black Marker anonymization scheme, since the latterreplaces IP values with constants in a way that significantly degradesthe statistical characteristics of a network trace [1, 21, 22]. It hasthe same effect as simply printing the log and blacking-out all IPaddresses. This method loses all IP address information and iscompletely irreversible. While this method is simple, it is quiteundesirable because it does not allow correlation of events perpetratedagainst a single host. Sequential numbering could be used by addingsequence numbers to distinguish the attribute values according to theorder, but it requires large storage to maintain consistency of data.

Hashing addresses this limitation by using a cryptographic hashingfunction. In general, the generated hash value is smaller than theoriginal attribute value, so this may facilitate dictionary attacks,which can be avoided by combining the hash function with a secret keyusing Hash Message Authentication Codes (HMAC) algorithm. These methodsfor anonymizing URLs and filenames. The sequence numbering works fasterand results in shorter trace than keyed hashing, except that it can beexecuted on a single system compared to keyed hashing which is provenand used in distributed systems. Partitioning creates equivalencerelations and canonical examples for a set of values, and then assignsan anonymized value whose range is within the corresponding partition.Truncation anonymizes the IP and the MAC addresses by deleting the leastand keeping the most significant bits. This technique can make anend-point non identifiable [23]. For string attributes constantsubstitution can be used. The original data is replaced by a constant toadd confidentiality to sensitive attributes [22]. Applying it to theidentity attributes results in an undistinguishable data. Shufflingre-arranges pieces of data (e.g. within an attribute). Generalizationapproaches replace a data value by a more general data value [21]. Inthe k-anonymity model, the attributes are divided into sensitive,non-sensitive and quasi identifier attributes. Several equivalenceclasses are created by hiding the values of quasi identifiers, such thatthe quasi identifier attributes of any record would be similar to atleast k-1 quasi identifier attributes. The k-anonymity model has somelimitations with regard to the diversity of sensitive attributes.Therefore the 1-diversity model requires the equivalence classes to haveat least 1 unique values for sensitive attributes [25]. Both k-anonymityand 1-diversity show good privacy protection on categorical attributes,but they lead to information loss when numerical attributes areanonymized. Micro aggregation techniques are comparable to k-anonymitytechniques as they work mainly on numerical attributes. The records areclustered such that each cluster includes at least k records. However,the features are replaced with values that represent information aboutthe cluster itself [26]. Micro-aggregation techniques cluster therecords in the dataset so that the similarity among data points inside acluster is minimized, while the similarity among data points indifferent clusters is maximized. The quasi identifier values are maskedin a way that is relevant to the cluster itself, e.g., they can bereplaced with the cluster average. Data generalization approaches areapplied to network traces by ‘partitioning’ information (also called‘grouping’ or ‘binning’), e.g., grouping TCP/UDP port numbers byassigning a fixed value to each group [11].

A multi-view solution to address the problem of injection attacks may beproposed. This solution is based on the following basic method: after aninitial application of prefix-preserving encryption on IP addresses, thedata owner divides data into partitions and generates a size d (d asnumber of partitions) vector V of randomly generated integers in therange of 1 to N. A random number c is multiplied to V to get a vectorV′. Data owner then applies prefix-preserving encryption techniques −V′of times on each partition. Here negative sign means prefix-preservingencryption is applied in a reverse fashion. The data owner sends thisanonymized data set along with boundary of partitions as well as vectorV to data analyzer. Since the random number c is withheld from dataanalyzer, data analyzer cannot recover original IP addresses. Howeverdata analyzer can applies prefix-preserving encryption V, 2V, . . . NVnumber of times on each partition. It is clear that one of this view(when αV) is applied recovers original anonymized data (before applyingV′ encryptions). Data analyzer then analyzes all N versions ofanonymized data and data owner can obliviously retrieve the correctversion of results. The basic method is then extended to provide moreprotection, however still N views will be generated and the dataanalyzer has to spend N times more effort to analyze data. Given thatnetwork trace data is often quite large, and N is also quite large (from20 to 160 in the paper), this method is often impractical.

A modified condensation based anonymization algorithm for network tracedata may be proposed [27]. This algorithm optimizes the tradeoff betweenprivacy protection and utility preservation, and it achieves much betterprivacy protection and utility preservation than existing anonymizationtechniques. There have been some research on anonymizing system logs[28, 29]. However, there are shortcomings in anonymizing logs similar totraces: 1) they only remove identifying information such as IP addressesor user names but are still vulnerable to injection attacks based onother attributes or add significant computational burden to dataanalyzer; 2) they do not optimize the tradeoff between privacyprotection and utility preservation.

B. Requirements for Anonymization and Existing Tools

Anonymization tools should satisfy a few requirements to maintain thevalue of traces. The first is pseudonym consistency requirement whichmeans that it is necessary to maintain consistency among anonymizationfor each distinct IP address or hardware addresses within a trace orbetween different traces that belong to the same organization. Thesecond requirement is to perform a systematic sanitization of thetransport, network, and data link layer header information in a trace,while, eliminating payload.

Different tools provide different trade-offs between privacy andinformation loss. Few tools work only on network layer information,while others work on cross layer packet anonymization [22]. Tcpdprivremoves private information from network traces using a prefixpreserving anonymization technique [30]. Xu et al. [7] improved tcpdprivby using a cryptography-based prefix preserving anonymization technique.Cryptopan can be used with parallel and distributed processing of traces[31]. It also meets the pseudonym consistency requirement. Fan et al.[32] evaluated the tool and found that attacks are still possible basedon trace type. Slagell et al. in [13] suggested an improved version ofCryptopan that performs prefix-preserving IP address pseudonymization.Slagell also proposed FLAIM [21], a tool that is not tied to thespecific log being anonymized and supports multi-level anonymization.Gamer et al. introduced Pktanon [33], a tool that achieves flexibility,extensibility and privacy; it allows arbitrary anonymization for everyprotocol field, and uses a defensive transformation technique to preventprivacy violations. Ipsumdump is a tool that translates tcpdump filesinto ASCII format to be easily readable by programs. It relies on prefixpreserving pseudonymization techniques [34]. Koukis et al. developed anAnonymization Application Programming Interface (AAPI) tool, so userscan write their own anonymization function and choose the appropriatepolicy for each attribute [15]. Foukarakis et al. [35] developedANONTOOL based on AAPI, a command line tool that can generate syntheticdata for both online and offline traces.

However, all these tools are still vulnerable to injection attacks onanonymized data which will be described in next section.

C. Attacks on Anonymized Data

Several attacks are initiated on anonymized data to reveal or infersensitive sanitized information, such as identifying network topology[36], or discovering user behaviour [9, 15, 37].

In general, there are two types of attacks: inspection attacks andinjection attacks. In inspection attacks the attacker is not authorizedand only has information from trace and observation, while in injectionattack the attacker is authorized and has knowledge about the injectedpattern that no one else has [6]. In injection attacks, the attackerinjects specific data into traffic. When the dataset is anonymized andreleased to the public, the attacker's goal is to identify the injectedpattern and therefore, easily discover the binding between the originaland anonymized data. For example, an attacker may inject a sequence ofpackets with certain patterns (e.g., specific source or destination portnumbers, specific delay between packets, or specific packet sizes. Theattacker may recognize these patterns in the anonymized data and throughreverse engineering may uncover original data. Gattani in [8] showedthat injection attacks are only possible when sufficient knowledge isavailable on when, how and where the trace is collected. So the bestcountermeasure against this attack is to keep such information private[7]. On the other hand, Eliminating the data generated by scanners thatprobe active hosts prior to anonymization, may provide a good approachto protect against active attacks. Injection attacks are performedeither by injecting complex patterns within short time or injectingsimple patterns over long time periods. Experiments that injected fivedifferent types of patterns with different complexity and concluded thatit is difficult to protect data from injection attacks using traditionalanonymization approaches. This approach can be empirically demonstratedthe effectiveness of this attack against prefix-preserving anonymizationand suggested remedies that might limit its damaging capability.However, it is quite difficult to stop such an attack without continuoushuman investigation. They also found that anonymizing IP addresses byassigning unique and static values to IP address via pseudonymizationdoes not guarantee adequate privacy and immunity against packetinjection attacks.

In structure recognition attacks, the objective is to exploit thestructure among objects to infer their identities. For example, tracesof Internet traffic will often include sequential address scans made byattackers probing for vulnerable hosts [38]. There are attacks that aimto recover IP addresses anonymized using prefix preserving anonymizationtechniques [38]. Those attacks exploit shared-text matching forcascading effects, with the shared text being the prefix.

Next, we describe our novel approach that generates an anonymizationmodel with strong privacy protection. In particular, we adapt animproved version of the K-anonymity [24] that incorporates adifferential privacy approach in it.

III. Approach

Most existing anonymization techniques only encrypt the IP addresses inthe data set, but they are vulnerable to injection attacks, where alarge fraction of injected packet patterns can be recovered even afterpermutation of IP addresses, bucketization of ports, adding random noiseto time, to number of packets, and to packet size.

FIG. 1 depicts our system architecture and flow of information. Networkdata is collected from network data sources. Adversaries may haveinjected certain traffic patterns as well. Our anonymization algorithmswill be applied to anonymize the collected data set and send theanonymized data to data recipients. We evaluate our approach based onprivacy, utility (measured as accuracy of attack detection), and whetherinjected patterns can be recovered.

Next we describe the sensitive attributes we need to anonymize inSection A. Section B describes pre-processing steps to collect labelednetwork flow data. Section C presents our anonymization method for IPaddresses. Section D reviews an existing anonymization method calledcondensation. Section E presents an enhanced condensation method toanonymize non IP features. This method supports K-anonymity. Section Fproposes a condensation-based differential-privacy anonymization method.Section G describes how to test injection attacks and Section Hdiscusses how to test our approach on sophisticated intrusion detectionmethods.

A. Sensitive Attributes

Data features in network data sources, when shared, may reveal thenetwork architecture, user identity, and user information; therefore, itis essential to identify sensitive information. Based on our review, weidentified several sensitive attributes that need to be protected.

IP addresses: They are considered one of the most important attributesto be anonymized. An attacker relies mainly on discovering the mappingof IP addresses to detect the host and network. For example, a source IPaddress indicates a user's IP, which may reveal the user identity. Thedestination IP address may be used by attackers to launch attacks. Inaddition, IP addresses may be used in intrusion detection algorithms aswell. For example, if we know attacks often originate from specific IPaddresses then we can be suspicious of possible attacks from these IPaddresses Similarly, if we know there have been attacks against acertain host (destination IP), then we can pay more attention to packetshaving that host as a destination IP. Thus we need to carefully balancethe need for privacy protection and intrusion detection when anonymizingIP addresses.

Timestamps: They do not indicate any user information; however, theycould be used in data injection attacks through discovering the mappingof information that is already known prior to anonymization. Inaddition, a time-stamp may denote specific action with respect toresponse delay and inter-flow time which could be matched with alreadyknown values.

Port Numbers: They partially characterize the applications that are usedto create the trace and may be used in fingerprinting attacks to revealthat a certain application with suspected vulnerabilities is running ona network where the trace is collected from.

Trace Counters: They indicate the number of packets per flow. Thisattribute may be used for fingerprinting and injection attacks.

B. Labeled Network Flow Data

In this paper we focus on anonymizing network flows with headerinformation. Each flow contains aggregated information from networkpackets that have common features, e.g. same source and destination IPs,same protocol etc. FIG. 2 shows an example of packets that areaggregated into four flows based on common features.

Packets that are close to each other in time and destined to samelocation are aggregated into a single flow.

Data about alerts raised by Intrusion Detection Systems (IDSs) areextracted and correlated with the corresponding flows. For example, if aflow contains information about packets associated with alerts, then theflow is automatically labeled as suspicious, otherwise it is labeled asnormal (more details on this correlation approach is discussed in ourprevious work in [39]). The result of this process is a labeled datasetof network flows.

Anonymization of IP Addresses

The following two stages describe anonymization of IP addresses: (1)First, encrypt/permute the leading digits of the IP addresses (networknumber). Intrusion detection methods can still use the leading portionof the IP addresses. Attackers may discover the subnet but the nextstage prevents identifying the host. (2) Then, for the remaining digitsof the IP (host number part), we cluster these addresses, and randomizeaddresses in the same cluster (exact IP address cannot be located).

Algorithm 1 summarizes the steps needed to anonymize the IP address. Thedataset D is divided into n datasets, such that D_i contains flows withlabel L_i where each label can be an attack or a benign activity. Thenpermute the leading digits of the IP addresses (network number) using aprefix preserving permutation function. The IP addresses are thenclustered into k clusters based on their least significant digits (hostnumber). The average for the least significant digits of IPs in the samecluster (host number) is calculated. Then the least significant digitsof IPs (the last three digits) in each cluster are replaced using thecomputed average. FIG. 3 shows an example of IP anonymization into twoclusters.

Algorithm 1: IP addresses Anonymization (D, k) 1. Divide dataset D inton datasets, such that D_(i) contains records with label L_(i) 2. For i=0to n do For j=1 to |IP| do  Permute the leading digits of the  IPaddresses (network number)  using prefix preserving  permutationfunction End for 3. Cluster IP addresses into k clusters based on theirleast significant digits (host number) 4. Sort the clusters in ascendingorder of cluster size. Let them be C₁, C₂, C_(k) 5. For each clusterC_(j) that contains less than k records, A. Find k- |C_(j)| recordsclosest to the center of C_(j) that lies in clusters that contain morethan k records B. Move these records to C_(j) 6. For each cluster C_(j)do a.  Compute the mean for the least  significant digits of IPs in the same cluster (host number) b.  Replace the least significant  digits ofIPs in each cluster using  the computed statistics End for End for

D. Data Anonymization Using Condensation

A heuristic condensation algorithm by Aggarwal and Yu, uses thestatistical characteristics of the original data in order to generatesynthetic data while preserving its statistical characteristics [40].Other anonymization algorithms are limited to noise addition to thedata, which may lead to insufficient privacy level.

The original condensation algorithm clusters records in the dataset suchthat the similarity among data points inside the clusters is minimizedand the similarity among data points in different clusters is maximized.Each cluster contains at least k records, the quasi identifier valuesare masked in a way that is relevant to the cluster; for example, theycan be replaced with the cluster averages.

E. Anonymizing Non-IP Features Using Modified Condensation

We utilize a condensation-based approach to perform anonymization onnon-IP features. We apply two modifications to the original condensationalgorithm.

First, we implement a per class condensation mechanism on networktraces. The original condensation algorithm does not consider thedifferences between classes to perform the de-identification. Ingeneral, there is a significant difference between the behavior ofnetwork attackers and other users and such differences need to becaptured in the anonymized data.

Second, the original condensation algorithm picks cluster centersrandomly, which may lead to inferior clusters. Instead, we utilizek-means clustering algorithm which is relatively efficient in terms ofwithin-class variance [2].

FIG. 4 shows the steps to anonymize non-IP features. The clusters aresorted in ascending order of cluster size. For each cluster C_j thatcontains less than k records, k-C_j records are selected if they are theclosest to the center of C_i that lies in clusters that contain morethan k records. The selected records are then moved to C_j. For eachcluster C_j the data is shifted into a new space using PCA. In the newspace Z_1, Z_2, Z_p are independent components. Then, a random dataZ_i{circumflex over ( )}′ with similar statistical features of

Z

_i is generated. Z_1{circumflex over ( )}′,Z_2{circumflex over( )}′,Z_3{circumflex over ( )}′ are combined into one dataset. Finally,Z{circumflex over ( )}′ is shifted back to the original data space usingreverse PCA.

Condensation-Based Differential-Privacy Anonymization Method

The differential privacy methods introduced by Dwork [18] providestronger privacy protection than K-anonymity. To protect the sensitiveinformation, differential privacy methods systematically add a randomnumber generated from a special distribution centered at zero to theresults of all data queries. Differential privacy mechanisms ensure thatthe addition or removal of single database record, has no significanteffect on the outcome of any analysis performed on the database. Theidea of preserving the privacy of network traffic using a noisy versionof the true answer is not new, however, the way of noise addition isdifferent in the case of differential privacy.

Our differential privacy approach works as follows. First, we implementa prefix-preserving technique to anonymize IP addresses. We permute then leading digits (network part). For the remaining digits cluster theseaddresses into K clusters (host part). Then we randomize addresseswithin the same cluster. Second, we implement a per class differentialprivacy mechanism. Third, we utilize the differential privacy in ourcondensation method. Our method clusters records based on the featuresof network trace data. Now each cluster has packets or flows withsimilar features. We then compute the mean of these features and addLaplace noise to the mean. The perturbed mean replaces the originalvalues. FIG. 5 illustrates our cluster-based differential privacyanonymization method. At first we partition data into three clusterswhich are displayed by different font type in the table. Then we computethe mean of each cluster and add Laplace noise to the mean. The finalstep is to replace every original value with this perturbed mean.

Algorithm 2: Differential Private-Condensation of network data (DatasetD, k) 1. Divide D into n datasets, such that D_(i) contains records withlabel L_(i) Let n_(i) be the size of D_(i) 2. For i=0 to n do a. Runk-means clustering on D_(i) to generate [n_(i)/k] b. Sort the clustersin ascending order of cluster size. Let them be C₁, C₂,...,C_(k) c. Foreach cluster C_(j) that contains less than k records, find k- |C_(j)|records closest to the center of C_(j) that lies in clusters thatcontain more than k records d. Move these records to C_(j) e. For eachcluster C_(j)  1. Synthetic Data generation: compute the mean of eachfeatures and corresponding Laplace noise.  2. Replace the data valueswith the perturbed mean.  End for End for

The added noise follows Laplace distribution with mean zero and standarddeviation=sensitivity /, where sensitivity=(max value in cluster−minvalue in cluster)/cluster size.

Epsilon is a small constant (usually around 0.01). According to thedefinition we can see that the larger the cluster size, the smaller thenoise, so this method works better for large volumes of data. Algorithm2 shows the steps to achieve differential privacy on all features exceptIP addresses, which are anonymized based on an IP prefix-preservingmethod. First the dataset is divided into subsets and each subsetcontains instances with an identical class label. Then we utilizeK-means clustering method to generate clusters per subset. Since we mayend up with some clusters having fewer than k- points and some may havemore. Some points are moved from large clusters to small clusters. Thenwe compute mean of these features and adds Laplace noise to the mean.The perturbed mean will replace the original values.

Testing injection attacks on data anonymized by our algorithms

We want to investigate whether datasets that have been anonymized withdifferential privacy and other approaches are robust enough to withstandinjection attacks. Table 1 shows some example patterns used to preparethe injected data. We inject similar patterns and they are injected inthe data before anonymization. The data is anonymized using severalpermutation-based anonymization policies including our proposeddifferential privacy method. Table 2 shows the anonymization approacheslisted in [6]. We then try to identify the injected patterns. We useK-NN Search to recover the injected patterns from the anonymized data.Formally, knnsearch(p_i,A_i)∀p_i finds the nearest neighbor in theanonymized data A_i for each record that represents the pattern p_i. Theresult of K-NN search is a column vector where each record contains theindex of nearest neighbor in the anonymized flows for the correspondingrecord in the injected flows. If there is a match between the injectedpattern and the nearest neighbor, the attack is considered successful.The number of recovered patterns using each anonymization policy isreported. An example on an injection attack, and how it is recovered isshown in FIG. 6.

TABLE 1 Patterns injected in the trace data Destination Packets Sourceport port Duration Octets P₁ 1 Fixed 80 — 160 P₂ 5 R(65k) R(65k) 200 256P₃ 110 Fixed 80 200 480[+32] P₄ 10 R(65k) R(65k) 200 832[+32] P₅ 50R(65k) R(65k) 150 + R(300)  1208[+R(8)] Values in square brackets denotethe attribute evolution between flows R(x): random number between 1 andx Total number of injected flows is 650 (130) flows from each pattern

TABLE 2 Anonymization polices used for testing data injection attacks IPAddr. Ports Time [S] Packets Octets A₁ Permutation — — — — A₂Permutation — — O(5) O(50) A₃ Permutation B(8) O(30) — — A₄ PermutationB(2) O(60) — — A₅ Permutation B(8) O(30) O(5) O(50) A₆ Permutation B(2) O(120)  O(10)  O(200) B(x): bucketized in x buckets, O(x): Added auniform random offset between −x and +x

H. Sophisticated Intrusion Detection Methods

An important question comes up with any anonymization method: How welldo existing techniques work when they are applied on anonymized dataversus the original data? In this section we provide an answer to thisquestion by using intrusion detection methods applied on anonymized andoriginal data. This simple classification based intrusion detectionmethods generate accurate results on anonymized data sets compared withoriginal non-anonymized data sets. We now test our anonymization methodon sophisticated intrusion detection methods such as those that rely onsemantic networks [41]. Semantic networks are graphs with nodesrepresenting attacks or benign activities and edges representing thesemantic links between them, and they are called semantic linkednetworks (SLN). The stronger the relationship between nodes the higherthe possibility they co-occur under a particular context. Consequently,observing one suspicious node can help proactively avoid another. Wegenerate two types of SLNs, the first one was based on the originaltrace data, while the second one on the anonymized data. Logged labeledflows are anonymized, then classification techniques are applied on theanonymized flows. Then, the SLN generated from anonymized data is usedto identify intrusions. We finally compare the SLN generated withanonymized data, against the SLN built over original data. More detailsapplying SLNs for intrusion detection can be found in our previous work[41].

IV. Experminents and Evaluation

In this section, we present the results of the experiments that havebeen conducted to test the effectiveness of our anonymization approach.Two sets of experiments are described:

The first set of experiments evaluates our approach by measuring privacyand accuracy. Two different datasets are anonymized and privacy ismeasured on the resulting datasets. In addition, accuracy is measuredbefore and after anonymization. Furthermore, we measure the robustnessof the approach when the anonymized data is used in sophisticatedintrusion detection techniques, versus when the original non-anonymizeddata was used to generate such techniques;

The second set of experiments measures the immunity of the proposedtechniques against data injection attacks. We measure the recovery rateof several patterns that are injected in the datasets beforeanonymization.

Objectives of experiments: In the experimental evaluation we prove thatour model:

Works reliably and is accurate enough while preserving privacy, whencompared with other approaches;

Is immune against Data Injection Attacks;

Works very well when applied on sophisticated intrusion detectiontechniques.

A. Datasets

We chose two datasets to run our experiments:

PREDICT Repository: PREDICT (A Protected REpository for Defense ofInfrastructure against

Cyber Threats) has shared real-world datasets for cyber securityresearch to advance the state-of-the-art network security research anddevelopment [42]. In our experiments we used packet captures from the2013 National Collegiate Cyber Defense Competition (nccdc.org).

We created a software application to generate flows from packet capturesand correlate the created flows with alerts generated by the SnortIntrusion Detection System [39, 43]. We generated a total of 400893benign and suspicious flows to use in our experiments.

University of Twente Dataset: The second data set provided by Sperottoet al. was created at the University of Twente by monitoring a honeypotfor HTTP, SSH and FTP traffic [44]. We selected 401732 suspicious flowsfrom this dataset with the corresponding alerts.

Since the PREDICT dataset contains mostly normal flows and the Twentedataset mostly attack flows, we draw a random sample from each datasetand combine them to create two new mixed datasets. The combined datasetsare:

Dataset 1: 70% PREDICT dataset+30% Twente dataset

Dataset 2: 50% PREDICT dataset+50% Twente dataset

These two datasets are further partitioned into training (70%) andevaluation (30%) parts.

Evaluation Measures

Accuracy: we employ several accuracy measures to validate theeffectiveness of our anonymization algorithms such as TP Rate, FP Rate,Precision, Recall, F-Measure, and ROC (Receiver OperatingCharacteristic) area.

Privacy: We use conditional privacy to measure the privacy of anonymizedtraffic data [45]. This measure depends on mutual information betweenthe raw and anonymized records at a certain confidence level; whileinformation loss is related to the amount of mismatch among the recordsbefore and after, conditional privacy is based on differential entropyof a random variable. The differential entropy of A given B=b is

h(A|B)=∫_(Ω) _(A,B) f _(A,B)(a, b)log₂ f _(A|B=b)(a)da db   (1)

Where A is a random variable that describes the data, and B is avariable that gives information on A.

Ω

_(A,B) identifies the domain of A and B. Therefore, the averageconditional privacy of A given B is

Π(A|B)=2^(h(A|B))   (2)

If D_i is the attribute value of the original data and

D_i{circumflex over ( )}′ is the value after anonymization. Theconditional privacy for anonymizing that attribute is

2^(h(D) ^(i) ^(|D) ^(i) ⁴⁰ ⁾ where h(D _(i) |D _(i)′)

is the conditional entropy of the original data given the anonymizeddata. Conditional privacy is calculated and averaged over allattributes.

C. Results

1) Privacy Results:

FIGS. 7 and 8 show the conditional privacy results for the anonymizeddatasets generated by the original condensation technique (described insection III-E) and our modified one (Algorithm 2). This experimentmeasures the effect of increasing the cluster size (k) on the values ofconditional privacy when using generalization approaches such ascondensation. We utilized the pure condensation but without preservingthe Prefix of IP. Then, we utilized condensation, and we preserved theprefix of IP addresses.

Two main conclusions can be drawn from those figures. First, the valuesof conditional privacy get higher when we increase k. Pure condensationattains higher privacy values than condensation withprefix-preservation. Source and destination IP address have asignificant contribution to the higher privacy values in the case ofpure condensation. However, the prefixes of IP addresses are notpreserved using pure condensation, which leads to more information lossand higher values of conditional privacy.

The second set of privacy experiments compares different anonymizationtechniques, including our differential privacy approach. FIG. 9 showsthe privacy measures for the Dataset 1 using different techniques. Basedon this measure our technique (Algorithm 2) performed better than mostof existing techniques. We utilized three types of condensationapproaches: First, we performed typical condensation but withoutpreserving the prefix of IP. Then, we performed typical condensation,but we performed prefix preserving anonymization on IP addresses.Finally, we performed our perclass condensation method with IP prefixpreserving and differential private-perclass condensation with IP prefixpreservation. It is observed that pure condensation attains higherprivacy values than prefix-preserving condensation. The perclasscondensation method with differential privacy approach outperformed allother methods. The experiment results using Dataset 2 shown in FIG. 10,are similar but our approach reveals much higher levels of privacycompared to the other approaches.

2) Accuracy Results

We ran several experiments to measure and compare accuracy on anonymizedvs. original data. We utilize K-Nearest Neighbors (KNN) classifier torun our experiments. Tables 3 and 4 show the KNN classification resultson Dataset 1 and Dataset 2 respectively. In terms of accuracy, ourapproach when applied with prefix preserving approach to anonymize data(Condensation-per Class, differential private-Condensation-Per Class)achieves the highest accuracy compared to other techniques. It isevident that while some techniques such as Black Marker achieveacceptable privacy levels, they lead to high information loss asdemonstrated by our privacy results. Therefore, the results when suchapproaches are used are low compared to other techniques. It is alsonoticed that there is a significant difference between approaches thatbelong to the same category. For instance, truncation attains higheraccuracy compared to reverse-truncation. Reverse truncation sets themost significant bits to zero, therefore, the predictability of featuresis significantly affected after anonymization. The results clearlyindicate the importance of prefix preserving approach to decrease theamount of information loss. Consequently, all approaches that apply ourprefix preserving algorithm attain higher accuracy values. In addition,the prefix preserving differential privacy algorithm achieves the bestresults in terms of accuracy. Contrary to approaches such as BlackMarker and Truncation, the results of the differential privacy algorithmare consistent across both datasets when comparing the results shown inTables 3 and 4.

TABLE 3 Dataset 1 Experiment-KNN Classification on Anonymized andOriginal data TP Rate FP Rate Precision Recall F-Measure ROC Area ClassOriginal 0.98 0.013 0.981 0.98 0.98 0.984 Attack 0.987 0.02 0.987 0.9870.987 0.984 Normal 0.984 0.017 0.984 0.984 0.984 0.984 AvgCondensation-Per 0.941 0.059 0.961 0.941 0.951 0.941 Attack Class PrefixPreserving IP 0.941 0.059 0.913 0.941 0.927 0.941 Normal 0.941 0.0590.942 0.941 0.941 0.941 Avg Condensation-All 0.628 0.582 0.62 0.6280.624 0.523 Attack Classes Prefix Preserving IP 0.418 0.372 0.426 0.4180.422 0.523 Normal 0.545 0.498 0.543 0.545 0.544 0.523 Avg DifferentialPrivacy-Per 0.941 0.059 0.96 0.941 0.95 0.94 Attack Class PrefixPreserving IP 0.941 0.059 0.913 0.941 0.927 0.94 Normal 0.941 0.0590.941 0.941 0.941 0.94 Avg Pure Condensation 0.691 0.612 0.631 0.6910.66 0.54 Attack 0.388 0.309 0.454 0.388 0.418 0.54 Normal 0.571 0.4910.56 0.571 0.564 0.54 Avg Prefix-Preserving(IP) + 1 1 0.602 1 0.752 0.5Attack Generalization 0 0 0 0 0 0.5 Normal (Other Features) 0.602 0.6020.362 0.602 0.452 0.5 Avg Permutation 0.999 1 0.602 0.999 0.751 0.5Attack 0 0.001 0.048 0 0 0.5 Normal 0.602 0.602 0.381 0.602 0.452 0.5Avg Black Marker 1 1 0.602 1 0.752 0.5 Attack 0 0 0 0 0 0.5 Normal 0.6020.602 0.362 0.602 0.452 0.5 Avg Truncation 0.578 0.506 0.633 0.578 0.6040.577 Attack 0.494 0.422 0.436 0.494 0.463 0.577 Normal 0.544 0.4730.555 0.544 0.548 0.577 Avg Reverse Truncation 0.082 0.163 0.432 0.0820.137 0.46 Attack 0.837 0.918 0.376 0.837 0.519 0.46 Normal 0.382 0.4630.41 0.382 0.289 0.46 Avg

TABLE 4 Dataset 2 Experiment-KNN Classification on Anonymized andOriginal Data TP Rate FP Rate Precision Recall F-Measure ROC Area ClassOriginal 0.991 0.013 0.991 0.991 0.991 0.989 Attack 0.987 0.009 0.9870.987 0.987 0.989 Normal 0.984 0.011 0.989 0.989 0.989 0.989 AvgCondensatian-Per 0.954 0.118 0.917 0.954 0.935 0.918 Attack Class PrefixPreserving IP 0.882 0.046 0.934 0.882 0.907 0.918 Normal 0.924 0.0880.924 0.924 0.923 0.918 Avg Condensation-All 0.553 0.562 0.575 0.5530.564 0.495 Attack Classes Prefix Preserving IP 0.438 0.447 0.416 0.4380.427 0.495 Normal 0.504 0.514 0.508 0.504 0.506 0.495 Avg DifferentialPrivacy-Per 0.975 0.125 0.915 0.975 0.944 0.945 Attack Class PrefixPreserving IP 0.875 0.025 0.962 0.875 0.916 0.945 Normal 0.933 0.0830.935 0.933 0.932 0.945 Avg Pure condensation 0.662 0.597 0.603 0.6620.631 0.532 Attack 0.403 0.338 0.464 0.403 0.431 0.532 Normal 0.5530.488 0.545 0.553 0.547 0.532 Avg Prefix-Preserving(IP) + 1 1 0.579 10.733 0.67 Attack Generalization 0 0 0 0 0 0.67 Normal (Other Features)0.579 0.579 0.335 0.579 0.424 0.67 Avg Permutation 0.083 0.31 0.27 0.0830.127 0.387 Attack 0.69 0.917 0.354 0.69 0.468 0.387 Normal 0.339 0.5660.305 0.339 0.271 0.387 Avg Black Marker 0 0 0 0 0 0.5 Attack 1 1 0.4211 0.593 0.5 Normal 0.421 0.421 0.178 0.421 0.25 0.5 Avg Truncation 0.4990.396 0.634 0.499 0.559 0.595 Attack 0.604 0.501 0.468 0.604 0.527 0.595Normal 0.544 0.44 0.564 0.544 0.546 0.595 Avg Reverse Truncation 0.9060.9 0.58 0.906 0.708 0.503 Attack 0.1 0.094 0.437 0.1 0.163 0.503 Normal0.567 0.56 0.52 0.567 0.478 0.503 Avg

3) Results on Sophisticated Intrusion Detection Techniques

In this set of experiments we evaluate the two types of SLNs createdusing the original and the anonymized datasets. Different Intrusiondetection evaluation metrics are used to measure the success rate ofidentifying attacks using SLNs when applied on top of K-NNclassification techniques.

The two types of initial SLNs created before and after anonymization areshown on FIGS. 11 and 12. The two SLNs have the exact same structure. Inaddition, the strengths of relationships between attacks (values on thegraph edges) are very close on the SLNs before and after anonymization.When SLNs are used for attack detection purposes, typically theyincrease recall values. We experimented with SLNs using different valuesof a threshold t which specifies the minimum cutoff value to includerelevant nodes to the starting one initially predicted by the K-NNclassifier. We compare the accuracy of identifying attacks using SLNs onanonymized and original datasets in terms of Precision, Recall,F-measure and Receiver Operating Characteristic (ROC). The ROC is apopular measure that has been used to compare intrusion detectiontechniques and plot TP and FP rates associated with various operatingpoints when different intrusion detection techniques are used. Thevalues of TP and FP rates (TPR and FPR) are calculated as:

$\begin{matrix}{{TPR} = \frac{TP}{{TP} + {FN}}} & (3) \\{{FPR} = \frac{FP}{{FP} + {TN}}} & (4)\end{matrix}$

The results of this experiment are shown on FIGS. 13a -13 e. The resultsof the experiments on Dataset1 show that there are no significantdifferences of Precision, Recall, and F-measure values before and afteranonymization. The ROC curve for each dataset are shown on FIG. 13eclearly close and the differences between original and anonymizeddatasets are very small in terms of TPR and FPR. Results on the seconddataset are very similar to the first dataset, so they are omitted.

4) Results on Injection Attacks

We simulate injection attacks by adding records with specific patternsto the original datasets. Then we run the anonymization algorithms onthe two datasets and try to identify the injected records. We thencompare the Injected Pattern Recovery Rate (IPRR) on variousanonymization policies using the following formula:

$\begin{matrix}{{I\; P\; R\; R} = \frac{{Recovered}\mspace{14mu}{injected}\mspace{14mu}{pattern}}{{Total}\mspace{14mu}{number}\mspace{14mu}{of}\mspace{14mu}{injected}\mspace{14mu}{patterns}}} & (5)\end{matrix}$

We applied five anonymization policies (A_1-A_5) in addition to ourdifferential privacy approach on both datasets. Those policies aredescribed in table 1. In addition, the five patterns described insection III-F are injected in both datasets. Patterns 1, 2 and 3 aresimpler than patterns 4 and 5. Each pattern consists of 130 records,with a total of 650 injection attempts. Those patterns work as afingerprint in the original data to be discovered later afteranonymization. The objective of this process is to discover the immunityof the anonymization algorithms against injection attacks [6]. Theresults of this experiment on both datasets are shown on FIGS. 14a and14b . It is clear that permutation-based anonymization policies lead tothe highest recovery ratio compared to other approaches; KNN searchdiscovers the majority of records for patterns 1 and 3. As thecomplexity of the permutation function used in the anonymization policyincreases, IPRR values decreases. However, KNN search still discovers asignificant percentage of the injected patterns. Compared to theseanonymization policies, when our differential privacy-basedanonymization policy is used, zero records are recovered, testifying tothe robustness of our approach against injection attacks.

V. Conclusion and Future Work

Embodiments herein provide a method that utilizes differential privacyto anonymize network traces and it has the following characteristics: Ithas a very strong privacy guarantee; it is robust when used ingenerating attack prediction models even when sophisticated intrusiondetection techniques such as graph-based approaches are used; it doesnot add any burden to data analyser. Data analyser can analyse data asit is. Our experiments show that using differential privacy foranonymization produces superior results compared to existing techniquesin terms of privacy-utility trade-off.

The embodiments herein, and/or the various features or advantageousdetails thereof, are explained more fully with reference to thenon-limiting embodiments that are illustrated in the accompanyingdrawings and detailed in the following description. Descriptions ofwell-known components and processing techniques are omitted so as to notunnecessarily obscure the embodiments herein. The examples used hereinare intended merely to facilitate an understanding of ways in which theembodiments herein may be practiced and to further enable those of skillin the art to practice the embodiments herein. Accordingly, the examplesshould not be construed as limiting the scope of the embodiments herein.Rather, these embodiments are provided so that this disclosure will bethorough and complete, and will fully convey the scope of the inventiveconcepts to those skilled in the art. Like numbers refer to likeelements throughout.

In a preferred embodiment, the system may be described using functionalblock diagrams to describe a machine in the example form of computersystem, within which a set of instructions for causing the machine toperform any one or more of the methodologies, processes or functionsdiscussed herein may be executed. In some examples, the machine is aplurality of devices in communication with a Server as described above.The machine operates as both a server or a client machine in aclient-server network environment when each device is connected to theServer in the cloud. The machine may be any special-purpose machinecapable of executing a set of instructions (sequential or otherwise)that specify actions to be taken by that machine for performing thefunctions describe herein. Further, while only a single machine isillustrated, the term “machine” shall also be taken to include anycollection of machines that individually or jointly execute a set (ormultiple sets) of instructions to perform any one or more of themethodologies discussed herein.

Example computer systems may include processor, memory, data storage andcommunication interface, which may communicate with each other via dataand control bus. In some examples, the computer system also includes adisplay and/or user interface.

Processor may include, without being limited to, a microprocessor, acentral processing unit, an application specific integrated circuit(ASIC), a field programmable gate array (FPGA), a digital signalprocessor (DSP) and/or a network processor. Processor may be configuredto execute processing logic for performing the operations describedherein. In general, processor may include any suitable special-purposeprocessing device specially programmed with processing logic to performthe operations described herein.

Memory may include, for example, without being limited to, at least oneof a read-only memory (ROM), a random access memory (RAM), a flashmemory, a dynamic RAM (DRAM) and a static RAM (SRAM), storingcomputer-readable instructions executable by processing device. Ingeneral, memory may include any suitable non-transitory computerreadable storage medium storing computer-readable instructionsexecutable by processing device for performing the operations describedherein. In some examples, computer system may include two or more memorydevices (e.g., dynamic memory and static memory).

Computer system may include communication interface device, for directcommunication with other computers (including wired and/or wirelesscommunication), and/or for communication with network. In some examples,computer system may include display device (e.g., a liquid crystaldisplay (LCD), a touch sensitive display, etc.). In some examples,computer system may include user interface (e.g., touchscreen, keyboard,an alphanumeric input device, a cursor control device, etc.).

In some examples, computer system may include data storage devicestoring instructions (e.g., software) for performing any one or more ofthe functions described herein. Data storage device may include anysuitable non-transitory computer-readable storage medium, including,without being limited to, solid-state memories, optical media andmagnetic media.

Various implementations of the systems and techniques described here maybe realized in digital electronic circuitry, integrated circuitry,specially designed ASICs (application specific integrated circuits),computer hardware, firmware, software, and/or combinations thereof.These various implementations may include implementation in one or morecomputer programs that are executable and/or interpretable on aprogrammable system including at least one programmable processor, whichmay be special or general purpose, coupled to receive data andinstructions from, and to transmit data and instructions to, a storagesystem, at least one input device, and at least one output device. Thesecomputer programs (also known as programs, software, Softwareapplications or code) include machine instructions for a programmableprocessor, and may be implemented in a high-level procedural and/orobject-oriented programming language, and/or in assembly/machinelanguage.

As used herein, the terms “machine-readable medium” “computer-readablemedium” refers to any computer program product, apparatus and/or device(e.g., magnetic discs, optical disks, memory, Programmable Logic Devices(PLDs)) used to provide machine instructions and/or data to aprogrammable processor, including a machine-readable medium thatreceives machine instructions as a machine-readable signal.

The term “machine-readable signal” refers to any signal used to providemachine instructions and/or data to a programmable processor.

To provide for interaction with a user, the systems and techniquesdescribed here may be implemented on a computer having a display devicefor displaying information to the user and a U.I. touchscreen, styluspencil, voice command, keyboard and a pointing device (e.g., a mouse ora trackball) by which the user may provide input to the computer. Otherkinds of devices may be used to provide for interaction with a user aswell; for example, feedback provided to the user may be any form ofsensory feedback (e.g., visual feedback, auditory feedback, or tactilefeedback); and input from the user may be received in any form,including acoustic, speech, or tactile input.

The systems and techniques described here may be implemented in acomputing system that includes a back end component (e.g., as a dataserver), or that includes a middleware component (e.g., an applicationserver), or that includes a front end component (e.g., a client computerhaving a graphical user interface or a Web browser through which a usermay interact with an implementation of the systems and techniquesdescribed here), or any combination of such back end, middleware, orfrontend components.

The components of the system may be interconnected by any form or mediumof digital data communication (e.g., a communication network). Examplesof communication networks include a local area network (“LAN”), a widearea network (“WAN”), and the Internet. The computing system may includeclients and servers. A client and server are generally remote from eachother and typically interact through a communication network. Therelationship of client and server arises by virtue of computer programsrunning on the respective computers and having a client-serverrelationship to each other. A number of embodiments have been described.Nevertheless, it will be understood that various modifications may bemade without departing from the spirit and scope of the invention.

While illustrative embodiments have been described herein, the scopethereof includes any and all embodiments having equivalent elements,modifications, omissions, combinations (e.g., of aspects across variousembodiments), adaptations and/or alterations as would be appreciated bythose in the art based on the present disclosure. For example, thenumber and orientation of components shown in the exemplary systems maybe modified.

1: A method, comprising: using differential privacy to anonymize networktraces; Providing a very strong privacy guarantee; Generating robustattack prediction models even when sophisticated intrusion detectiontechniques such as graph-based approaches are used; Wherein the methoddoes not add any burden to data analyser, Data analyser can analyse datawithout modification; wherein the computing device each has a memory anda hardware processor, and programming instructions saved to the memoryand executable on the hardware processor for performing the steps toeffect the methods steps. 2: A computer-implemented system forperforming the methods herein, comprising: a computing device having amemory and a hardware processor and program instructions saved to thememory and executable by the processor for running an applicationconfigured to perform the method steps.